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ABSTRACT 

In  recent  years,  various  constrained  frequent  pattern  mining 
problem  formulations  and  associated  algorithms  have  been 
developed  that  enable  the  user  to  specify  various  itemset- 
based  constraints  that  better  capture  the  underlying  appli¬ 
cation  requirements  and  characteristics.  In  this  paper  we 
introduce  a  new  class  of  block  constraints  that  determine  the 
significance  of  an  itemset  pattern  by  considering  the  dense 
block  that  is  formed  by  the  pattern’s  items  and  its  associ¬ 
ated  set  of  transactions.  Block  constraints  provide  a  natural 
framework  by  which  a  number  of  important  problems  can  be 
specified  and  make  it  possible  to  solve  numerous  problems 
on  binary  and  real-valued  datasets.  However,  developing 
computationally  efficient  algorithms  to  find  these  block  con¬ 
straints  poses  a  number  of  challenges  as  unlike  the  different 
itemset-based  constraints  studied  earlier,  these  block  con¬ 
straints  are  tough  as  they  are  neither  anti-monotone,  mono¬ 
tone,  nor  convertible.  To  overcome  this  problem,  we  in¬ 
troduce  a  new  class  of  pruning  methods  that  can  be  used 
to  significantly  reduce  the  overall  search  space  and  make 
it  possible  to  develop  computationally  efficient  block  con¬ 
straint  mining  algorithms.  We  present  an  algorithm  called 
CBMiner  that  takes  advantage  of  these  pruning  methods 
to  develop  an  algorithm  for  finding  the  closed  itemsets  that 
satisfy  the  block  constraints.  Our  extensive  performance 
study  shows  that  CBMiner  generates  more  concise  result 
set  and  can  be  order(s)  of  magnitude  faster  than  the  tradi¬ 
tional  frequent  closed  itemset  mining  algorithms. 


*This  work  was  supported  in  part  by  NSF  CCR- 
9972519,  EIA-9986042,  ACI-9982274,  ACI-0133464,  and 
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versity  of  Minnesota;  and  by  the  Army  High  Performance 
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of  the  Department  of  the  Army,  Army  Research  Laboratory 
(ARL)  under  Cooperative  Agreement  number  DAAD19-01- 
2-0014.  The  content  of  which  does  not  necessarily  reflect 
the  position  or  the  policy  of  the  government,  and  no  offi¬ 
cial  endorsement  should  be  inferred.  Access  to  research  and 
computing  facilities  was  provided  by  the  Digital  Technology 
Center  and  the  Minnesota  Supercomputing  Institute. 


1.  INTRODUCTION 

Finding  frequent  patterns  in  large  databases  is  a  funda¬ 
mental  data  mining  task  with  extensive  applications  to  many 
areas  including  association,  correlation,  and  causality  rule 
discovery,  association-rule-based  classification,  and  feature- 
based  clustering.  As  a  result,  a  vast  amount  of  research 
has  focused  on  this  problem  resulting  in  the  development  of 
numerous  efficient  algorithms.  This  research  has  been  pri¬ 
marily  focused  on  finding  frequent  patterns  corresponding 
to  itemsets  and  sequences,  but  the  ubiquitous  nature  of  the 
problem  has  also  resulted  in  the  development  of  various  algo¬ 
rithms  that  find  frequent  spatial,  geometric,  and  topological 
patterns,  as  well. 

In  recent  years,  researchers  have  recognized  that  in  many 
application  areas  and  problem  settings  frequency  is  not  the 
best  measure  to  use  in  determining  the  significance  of  a  pat¬ 
tern  as  it  depends  on  a  number  of  other  parameters  such 
as  the  type  of  items  that  it  contains,  the  length  of  the  pat¬ 
tern,  or  various  numerical  attributes  associated  with  the  in¬ 
dividual  items.  In  such  cases,  even  though  frequent  pattern 
discovery  algorithms  can  still  be  used  as  a  pre-processing 
step  to  identify  a  set  of  candidate  patterns  that  are  subse¬ 
quently  pruned  by  taking  into  account  the  additional  param¬ 
eters,  they  tend  to  lead  to  inefficient  algorithms  as  a  large 
number  of  the  discovered  patterns  will  eventually  get  elimi¬ 
nated.  To  address  this  problem,  various  constrained  frequent 
pattern  mining  problem  formulations  have  been  developed 
that  enable  the  user  to  focus  on  mining  patterns  with  a  rich 
class  of  constraints  that  capture  the  application  semantics. 
This  includes  algorithms  that  assign  different  frequency  con¬ 
straints  to  the  individual  items  [14,  29]  and  algorithms  that 
find  the  itemsets  A'  that  satisfy  various  quantitative  con¬ 
straints  of  the  form  avg(A')  <  v,  avg(A')  >  v,  sum(A')  <  v, 
or  sum(A')  >  v  [21].  The  key  property  of  these  itemset  con¬ 
straints  is  that  they  are  usually  (or  can  be  converted  to) 
anti-monotone  or  monotone,  making  it  possible  to  develop 
computationally  efficient  algorithms  to  find  the  correspond¬ 
ing  patterns. 

In  this  paper  we  introduce  a  new  class  of  constraints  re¬ 
ferred  to  as  block  constraints,  which  determine  the  signifi¬ 
cance  of  an  itemset  pattern  by  considering  the  dense  block 
that  is  formed  by  the  pattern’s  items  and  its  associated  set  of 
transactions.  Specifically,  we  focus  on  three  different  block 
constraints  called  block  size,  block  sum,  and  block  similar¬ 
ity.  The  block  size  constraint  applies  to  binary  datasets, 
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the  block  sum  constraint  applies  to  datasets  in  which  each 
instance  of  an  item  has  a  non-negative  value  associated  with 
it  that  can  vary  across  transactions,  and  the  block  similarity 
constraint  applies  to  datasets  in  which  each  transaction  cor¬ 
responds  to  a  vector-space  representation  of  an  object  and 
the  similarity  between  these  objects  is  measured  by  the  co¬ 
sine  of  their  vectors.  According  to  the  block  size  constraint, 
a  pattern  is  interesting  if  the  size  of  its  dense  block  (obtained 
by  multiplying  the  length  of  the  itemset  and  the  number  of 
its  supporting  transactions)  is  greater  than  a  user-specified 
threshold.  Analogously,  according  to  the  block  sum  con¬ 
straint,  a  pattern  is  interesting  if  the  sum  of  the  values  of 
its  dense  block  is  greater  than  a  user-specified  threshold. 
Finally,  according  to  the  block  similarity  constraint  a  pat¬ 
tern  is  interesting  if  its  dense  block  accounts  for  a  certain 
user-specified  fraction  of  the  overall  similarity  between  the 
objects  in  the  entire  dataset. 

Finding  patterns  satisfying  the  above  constraints  has  ap¬ 
plications  in  a  number  of  different  areas.  For  example,  in 
the  context  of  market-basket  analysis,  the  block-size  and 
block-sum  constraints  can  be  used  to  find  the  itemsets  that 
account  for  a  certain  fraction  of  the  overall  quantities  sold  or 
revenue/profit  generated,  respectively,  whereas  in  the  con¬ 
text  of  document  clustering,  the  block  similarity  constraint 
can  be  used  to  identify  the  set  of  terms  that  bring  a  set 
of  documents  together  and  thus  correspond  to  thematically 
related  words  (commonly  referred  to  as  micro- concepts  [11]). 

Developing  computationally  efficient  algorithms  to  find 
these  block  constraints  is  particularly  challenging  because 
unlike  the  different  itemset-based  constraints  studied  ear¬ 
lier,  these  block  constraints  are  tough  [19]  as  they  are  neither 
anti-monotone,  monotone,  nor  convertible  [19].  To  overcome 
this  problem,  we  introduce  a  new  class  of  pruning  methods 
that  can  be  used  to  significantly  reduce  the  overall  search 
space  and  make  it  possible  to  develop  computationally  ef¬ 
ficient  block  constraint  mining  algorithms.  Specifically,  we 
focus  on  the  problem  of  finding  the  closed  itemsets  satis¬ 
fying  the  proposed  block  constraints  and  present  a  projec¬ 
tion  based  mining  framework,  called  CBMiner  that  takes 
advantage  of  a  matrix-based  representation  of  the  dataset. 
CBMiner  pushes  deeply  the  various  block  constraints  into 
closed  pattern  mining  by  using  three  novel  classes  of  prun¬ 
ing  methods  called  column  pruning ,  row  pruning,  and  ma¬ 
trix  pruning  that  when  combined  lead  to  dramatic  perfor¬ 
mance  improvements.  We  present  an  extensive  experimental 
evaluation  using  various  datasets  that  shows  that  CBMiner 
not  only  generates  more  concise  result  set,  but  also  is  much 
faster  than  the  traditional  frequent  closed  itemset  mining  al¬ 
gorithms.  Moreover,  we  present  an  interesting  application  in 
the  context  of  document  clustering  that  illustrates  the  use¬ 
fulness  of  the  block  similarity  constraint  in  micro-concept 
discovery. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2 
introduces  some  basic  definitions  and  notations.  Section  3 
formulates  the  problem  and  motivates  each  one  of  the  three 
block  constraints.  Section  4  describes  some  related  work. 
Section  5  derives  the  framework  for  mining  closed  blocks, 
while  Section  6  discusses  in  detail  how  to  efficiently  mine 
closed  patterns  with  tough  block  constraints.  The  thorough 
performance  study  is  presented  in  Section  7.  Finally,  Sec¬ 
tion  8  provides  some  concluding  remarks. 

2.  DEFINITIONS  AND  NOTATION 


A  transaction  database  is  a  set  of  transactions,  where  each 
transaction  is  a  2-tuple  containing  a  transaction  id  and  a  set 
of  items.  Let  X  be  the  complete  set  of  distinct  items  and  T 
be  the  complete  set  of  transactions.  Any  non-empty  set 
of  items  is  also  called  an  itemset  and  any  set  of  transac¬ 
tions  is  called  a  transaction  set.  The  frequency  of  an  item- 
set  X  (denoted  as  freq(A'))  is  the  number  of  transactions 
that  contain  all  the  items  in  X,  while  the  support  of  X  is 
defined  as  cr(A)=freq  (A)/|T|.  For  a  given  minimum  sup¬ 
port  threshold  9  (0  <  9  <  1),  A'  is  said  to  be  frequent  if 
<t(A')  >9.  A  frequent  pattern  X  is  called  closed  if  there 
exists  no  proper  super-pattern  of  A'  with  the  same  support 
as  X.  An  itemset  constraint  C  is  a  predicate  on  the  power 
set  2Z,  i.e.,  C  :  2Z  — >  {TRUE,  F ALSE}.  An  itemset  con¬ 
straint  C  is  anti-monotone  if  for  any  itemset  X  that  satis¬ 
fies  C,  all  the  subsets  of  A'  also  satisfy  C,  and  C  is  mono¬ 
tone  if  all  the  supersets  of  A'  satisfy  C.  For  example,  the 
constraint  cr(A')  >  9  is  anti-monotone,  while  <r(A')  <  9  is 
monotone.  An  itemset  constraint  is  tough  if  it  is  neither 
anti-monotone  nor  monotone,  and  cannot  be  converted  to 
either  anti-monotone  or  monotone  constraint. 

A  block  is  defined  as  a  2-tuple  B  =  ( I,T ),  consisting  of 
an  itemset  I  and  a  transaction  set  T,  such  that  T  is  the 
supporting  set  of  I.  The  size  of  a  block  B  is  defined  as 
BSize(S)  =  |/|  x  |T|.  A  weighted  block  is  a  block  B  —  ( I,T ) 
with  a  weight  function  w  defined  on  the  cross-product  of 
the  itemset  and  transaction  set,  i.e.,  w  :  21  x  2T  — >  1Z+ , 
where  TZ+  is  the  set  of  positive  real  numbers.  The  sum  of  a 
weighted  block  B  is  defined  as  BSum(B)  =  iei  W(L  *)■ 

A  (weighted)  block  B  =  (I,  T)  is  said  to  be  a  (weighted) 
closed  block  if  and  only  if  there  exists  no  other  (weighted) 
block  B'  =  (/' ,T')  such  that  I'  D  I  and  T'  =  T.  Given 
a  (weighted)  block  B  =  (I,T),  a  (weighted)  block  B'  = 
(/'  ,T')  is  a  proper  superblock  of  B  if  I'  D  I  and  T'  C  T.  In 
such  a  case  B  is  called  a  (weighted)  proper  subblock  of  B' . 
We  will  use  B'  D  B  to  denote  that  B'  is  a  proper  superblock 
of  B  and  B  C  B'  to  denote  that  B  is  a  proper  subblock  of 
B'. 

A  block  constraint  C  is  a  predicate  on  the  power  set  2ZxZ, 
i.e.,  C  :  2Zxr  —  {TRUE,  FALSE}.  A  block  B  is  called  a 
valid  block  for  constraint  C  if  it  satisfies  constraint  C  (i.e., 
C(B)  is  TRUE).  A  block  constraint  C  is  a  tough  constraint 
if  there  is  no  dependency  between  the  satisfaction/violation 
of  a  constraint  by  a  block  and  the  satisfaction/violation  of 
the  constraint  by  any  of  its  superblocks  or  subblocks. 

A  transaction-item  matrix  M  is  a  matrix  where  each  row 
r  represents  a  transaction  and  each  column  c  represents  an 
item  in  T  such  that  the  value  of  the  (r,  c)  entry  of  the  matrix, 
denoted  by  M(r,c)  is  one  iff  transaction  r  supports  c,  oth¬ 
erwise  A 4(r,  c)  is  zero.  Similarly  a  weighted  transaction-item 
matrix  A4  is  a  transaction-item  matrix  where  for  each  row 
r  and  for  each  column  c,  A 4(r,c)  is  equal  to  w(r,c)  (where 
w  is  a  positive  weight  function  defined  on  all  transaction- 
item  pairs  in  T) .  A  (weighted)  block  B  =  (I,  T)  can  be 
redefined  as  a  (weighted)  dense  submatrix  of  the  (weighted) 
transaction-item  matrix  A4  formed  with  the  rows  of  T  and 
columns  of  I  such  that  Vr  G  T  and  Vc  G  /  we  have  Af  (r,  c)  = 
1  (M(r,  c)  >  0). 

Given  a  pre-defined  ordering  of  the  columns  of  A4  and  a 
set  p  of  columns  in  A4,  a  p-projected  matrix  w.r.t.  M,  A4\p, 
is  defined  as  the  submatrix  of  A4  containing  only  the  rows 
that  support  itemset  p  and  the  columns  that  appear  after  p 
in  matrix  A4.  For  any  transaction  t  in  A4|p,  its  size  is  defined 


2 


as  the  number  of  non-zero  elements  in  its  corresponding  row 
of  At  |p  and  will  be  denoted  by  |t[;  For  any  column  x  of  Af  \p, 
the  matrix  obtained  by  keeping  only  the  rows  of  Af|p  that 
contain  x  is  denoted  as  Af  |p.  For  each  matrix  A4|p  and  Af|p 
we  will  denote  their  set  of  corresponding  transactions  and 
items  as  T|p,  T\p,  I\p,  and  T\p,  respectively. 

Given  a  set  of  m-dimensional  vectors  A  =  {di ,  d,2 , . . .  ,  dn  } , 
the  composite  vector  of  A  is  denoted  by  D  and  is  defined 
to  be  A^“  Given  a  weighted  block  B  =  ( I,T ),  the 
composite  vector  of  the  block  is  denoted  by  Bj  and  is  the 
|2T|-dimensional  vector  obtained  as  follows.  For  each  item 
i  €  I,  the  ith  dimension  of  Bi,  denoted  by  Bi(i),  is  equal 
to  ^2yteTu>(t,i),  otherwise  if  i  I,  Bi(i)  =  0.  Also,  given 
a  p-projected  matrix  Al|p,  the  composite  vector  of  an  item 
x  within  Af|p  is  denoted  by  Bx  and  is  the  |X|-dimensional 
vector  obtained  from  the  transactions  included  in  T\p  such 
that  for  every  i  £  X\p,  Bx(i)  =  XXteTp  w(i,  z),  otherwise  if 
i£I\x,Bx(i)  =  0. 

Given  a  matrix  Al,  the  column- sum  of  column  i  in  Al 
is  denoted  by  csumyy/j(i)  and  is  defined  to  be  equal  to  the 
sum  of  the  values  of  the  column  i  of  At,  i.e.,  csum  ,\^(i)  = 
X)t  A i(t,  i).  Similarly,  the  row-sum  of  row  t  in  A1  is  denoted 
by  rsumy^(f)  and  is  defined  to  be  equal  to  the  sum  of  the 
values  of  the  row  t  of  Al,  i.e.,  rsum_y/((t)  =  X],  M(t,i). 

3.  PROBLEM  DEFINITION 

In  this  paper  we  develop  efficient  algorithms  for  finding 
valid  closed  blocks  that  satisfy  certain  tough  block  con¬ 
straints.  Specifically,  we  focus  on  three  types  of  block  con¬ 
straints  that  are  motivated  and  described  in  this  section. 
Block  Size  Constraint  In  the  context  of  market-basket 
analysis  we  are  often  interested  in  finding  the  set  of  itemsets 
each  of  which  accounts  for  a  certain  fraction  of  the  overall 
number  of  transactions  that  was  performed  during  a  certain 
period  of  time.  Given  an  itemset  /  and  its  supporting  set 
T,  the  extent  to  which  I  will  satisfy  this  constraint  will  de¬ 
pend  on  whether  or  not  |7|  x  |T|  is  no  less  than  the  specified 
fraction.  Finding  this  type  of  itemsets  is  the  motivation  be¬ 
hind  the  first  block-constraint  that  we  study,  which  focuses 
on  finding  all  blocks  B  =  (I,  T)  whose  size  is  no  less  than 
a  certain  threshold.  Specifically,  given  a  binary  transaction 
database  T,  the  block-size  constraint  is  defined  as 

BSize(B)  >  ON,  (1) 

where  0  <  9  <  1  and  N  is  the  total  number  of  non- zeros  in 
the  transaction-item  matrix  of  T,  i.e.,  N  =  X]teT  W- 

Note  that  depending  on  the  size  of  the  itemsets  associated 
with  each  valid  block,  the  minimum  required  size  of  the  cor¬ 
responding  transaction  set  will  be  different.  Small  itemsets 
will  require  larger  transaction  sets,  whereas  large  itemsets 
will  lead  to  valid  blocks  with  smaller  transaction  sets.  As 
a  result,  even  if  an  itemset  I  is  not  part  of  a  valid  block, 
an  extension  of  I,  I' ,  may  become  valid  ( e.g .,  cases  in  which 
the  support  of  I'  does  not  significantly  decrease  compared  to 
the  support  of  I).  Similarly,  an  itemset  I  which  is  not  part 
of  any  valid  block  may  contain  subsets  that  are  part  of  some 
valid  blocks  (e.g.,  cases  in  which  the  support  of  the  subset  is 
significantly  greater  than  the  support  of  I).  Consequently, 
the  block-size  constraint  is  a  tough  constraint  as  it  is  neither 
anti-monotone  nor  monotone,  and  cannot  be  converted  to 
either  anti-monotone  or  monotone  constraints. 


Block  Sum  Constraint  In  cases  in  which  there  is  a  non¬ 
negative  weight  associated  with  each  individual  transaction- 
item  pair  (e.g.,  sales  or  profit  achieved  by  selling  an  item  to 
a  customer),  in  addition  to  finding  all  itemsets  that  satisfy 
a  certain  block-size  constraint  we  may  also  be  interested  in 
finding  the  itemsets  whose  corresponding  weighted  blocks 
have  a  block-sum  that  is  greater  than  a  certain  threshold. 
For  example,  in  the  context  of  market-basket  analysis,  these 
itemsets  can  be  used  to  identify  the  product  groups  that 
account  for  a  certain  fraction  of  the  overall  sales,  profits,  etc. 
Motivated  by  this,  the  second  block-constraint  that  we  study 
extends  the  notion  of  the  block-size  constraint  to  weighted 
blocks.  Formally,  given  a  transaction  database  T,  and  a 
weight  function  w  the  block-sum  constraint  is  defined  as 

BSum(B)  >  e\V.  (2) 

where  0  <  0  <  1  and  W  is  the  sum  of  the  weights  of 
all  the  transaction- item  pairs  in  the  database,  i.e.,  W  = 
YlteT  i).  Note  that  since  the  block-sum  constraint 

is  a  generalization  of  the  block-size  constraint  it  also  repre¬ 
sents  a  tough  constraint. 

Block  Similarity  Constraint  The  last  block  constraint 
that  we  will  study  is  motivated  by  the  problem  of  find¬ 
ing  groups  of  thematically  related  words  in  large  document 
datasets,  each  potentially  describing  a  different  micro-concept 
present  in  the  collection.  One  way  of  finding  such  groups 
is  to  analyze  the  document-term  matrix  associated  with  the 
dataset  and  find  sets  of  words  that  satisfy  either  a  user  spec¬ 
ified  minimum  support  constraint  or  a  block-size  constraint 
(as  defined  earlier).  However,  the  limitation  of  these  ap¬ 
proaches  is  that  they  do  not  account  for  the  weights  that 
are  often  associated  with  the  various  words  as  a  result  of 
the  widely  used  tf-idf  (term-frequency — inverse  document- 
frequency)  vector-space  model  [24].  In  general,  groups  of 
words  that  have  higher  weights  will  more  likely  represent 
a  thematically  coherent  concept  than  words  that  have  very 
low  weights,  even  if  the  latter  groups  have  higher  support. 
This  often  happens  with  words  that  are  common  in  almost 
all  the  documents  and  will  be  assigned  very  low  weight  due 
to  their  high  document  frequency  [3]. 

One  way  of  addressing  this  problem  is  to  first  apply  the  tf- 
idf  model  on  each  document  vector,  scale  the  resulting  doc¬ 
ument  vectors  to  be  of  the  same  length  (e.g.,  unit  length), 
and  then  find  the  groups  of  related  words  by  using  the  pre¬ 
viously  defined  block-sum  constraint.  However,  within  the 
context  of  the  vector-space  model,  a  more  natural  way  of 
measuring  the  importance  of  a  group  of  words  is  to  look  at 
how  much  they  contribute  to  the  overall  similarity  between 
the  documents  in  the  collection.  In  other  words,  the  micro¬ 
concept  discovery  problem  can  be  formulated  as  that  of  find¬ 
ing  all  groups  of  words  such  that  the  removal  of  each  group 
from  their  supporting  documents  will  decrease  the  aggregate 
similarity  between  the  documents  by  a  certain  fraction.  In 
general,  groups  of  words  that  are  large,  supported  by  many 
documents,  and  have  high  weights  will  tend  to  contribute  a 
higher  fraction  to  the  aggregate  similarity  and  hence  form 
better  micro-concepts. 

Discovering  groups  of  words  that  satisfy  the  above  prop¬ 
erty  led  us  to  develop  the  block-similarity  constraint  that 
is  defined  as  follows.  Let  A  =  {di,d2, . . .  ,dn}  be  a  set  of 
n  documents  modeled  by  their  unit-length  tf-idf  representa¬ 
tion  of  the  set  of  documents,  let  m  be  the  distinct  number  of 
terms  in  A,  let  B  =  (I,  T )  be  a  weighted  block  with  /  being  a 
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set  of  words  and  T  being  its  supporting  set  of  documents,  let 
S  be  the  sum  of  the  pairwise  similarities  between  the  docu¬ 
ments  in  A,  and  let  S'  be  the  sum  of  the  pairwise  similarities 
between  the  documents  in  A  obtained  after  zeroing-out  the 
entries  corresponding  to  block  B.  The  similarity  of  the  block 
B  is  defined  to  be  the  loss  in  the  aggregate  pairwise  similar¬ 
ity  resulting  from  removing  B,  i.e.,  BSim(73)  =*  S  —  S' ,  and 
the  block- similarity  constraint  is  defined  as 

BSim(B)  >  6S,  (3) 

where  0  <  9  <  1. 

In  this  paper,  we  will  measure  the  similarity  between  two 
documents  di  and  dj  in  A  by  computing  the  dot-product  of 
their  corresponding  vectors  di  and  dj  {i.e.,  sim(dj, dj)  =  di  ■ 
dj).  Since  the  documents  in  A  have  already  been  scaled  to  be 
of  unit  length,  this  similarity  measure  is  nothing  more  than 
the  cosine  of  their  respective  vectors,  which  is  used  widely 
in  information  retrieval.  The  advantage  of  the  dot-product- 
based  similarity  measure  is  that  it  allows  us  to  easily  and 
efficiently  compute  both  S  and  S' .  Specifically,  if  D  is  the 
composite  vector  of  A,  it  can  be  shown  that  S  =  D  ■  D. 
Similarly,  if  B  =  (7,  T)  is  a  weighted  block  of  A,  and  Bj 
is  its  corresponding  composite  vector  it  can  be  shown  that 
S'  =  (D  —  Bi )  •  ( D  —  Bi).  As  a  result,  the  similarity  of  a 
block  B  =  (7,  T)  is  given  by 

BSim(73)  =  S  —  S'  =  2D  •  Bi  —  Bi  •  Bi.  (4) 

To  simplify  the  presentation  of  the  three  block  constraints 
and  the  associated  algorithms,  in  the  rest  of  this  paper  we 
will  consider  the  set  of  documents  A  as  forming  a  weighted 
transaction-item  matrix  M  whose  rows  and  columns  corre¬ 
spond  to  the  documents  and  terms  of  A,  respectively.  As  a 
result,  each  matrix  entry  A4{i,j)  will  be  equal  to  di{j )  {i.e., 
the  value  in  the  df  s  vector  along  the  jth  dimension). 

4.  RELATED  RESEARCH 

Efficient  algorithms  for  Ending  frequent  itemsets  in  large 
databases  have  been  one  of  the  key  success  stories  in  data 
mining  research  [2,  17,  6,  4,  10,  33].  One  of  the  early  com¬ 
putationally  efficient  algorithms  was  Apriori  [2],  which  finds 
frequent  itemsets  of  length  l  based  on  the  previously  mined 
frequent  itemsets  of  length  (7-1).  More  recently,  a  set  of 
database-projection-based  methods  [1,  10,  22]  have  been  de¬ 
veloped  that  significantly  reduce  the  complexity  of  finding 
frequent  long  patterns.  This  study  extends  the  projection- 
based  method  to  mine  valid  sub-matrices  with  tough  block 
constraints. 

The  frequent  itemset  mining  algorithms  usually  generate 
a  large  number  of  frequent  itemsets  when  the  support  is 
low.  To  solve  this  problem,  two  general  classes  of  techniques 
were  proposed.  The  first  is  mining  closed/maximal  patterns. 
Typical  examples  include  Max-Miner  [4],  A-close  [18],  MA¬ 
FIA  [8],  CHARM  [33],  CFP-tree  [15],  and  CLOSET  +  [28]. 
The  redundant  pattern  pruning  and  column  fusing  methods 
adopted  by  CBMiner  have  been  popularly  used  in  different 
forms  by  several  previous  studies  [4,  32,  23,  8,  33,  28,  15]. 
The  second  class  focuses  on  mining  constrained  patterns  by 
integrating  various  anti-monotone,  monotone,  or  convertible 
constraints.  The  constrained  association  rule  mining  prob¬ 
lem  was  first  considered  in  [27]  but  only  for  item  specific 
constraints.  Since  then  a  number  of  different  constrained 
frequent  pattern  mining  algorithms  have  been  proposed  [5, 


16,  19,  21,  7,  20,  13,  12].  All  these  algorithms  concentrate 
on  constrained  itemset  mining  with  various  anti-monotone, 
monotone,  succinct  or  convertible  constraints. 

Very  recently  some  work  [30]  has  been  done  to  push  aggre¬ 
gate  constraints  in  the  context  of  iceberg-cube  computing. 
This  algorithm  mines  aggregate  constraints  in  the  GROUP 
BY  partitions  of  an  SQL  query  by  using  a  divide-and-appro- 
ximate  strategy.  The  algorithm  makes  use  of  the  strategy  to 
derive  a  sequence  of  weaker  anti-monotone  constraints  for  a 
given  non-anti-monotone  constraint  to  prune  the  nodes  in 
the  search  tree.  Recently  the  LPMiner  algorithm  [25]  was 
proposed  to  mine  itemsets  with  length-decreasing  support 
constraints.  It  uses  a  novel  SVE  property  to  prune  the  un¬ 
promising  transactions  of  the  projected  databases  based  on 
the  length  of  the  transactions.  Later  the  SVE  property  has 
been  used  to  mine  sequences  and  closed  itemsets  with  length 
decreasing  support  constraints  [26,  31].  We  also  explore  the 
SVE  property  in  the  context  of  mining  closed  patterns  with 
block  constraints  in  Section  6.2  to  prune  the  unpromising 
rows  of  a  prefix-projected  matrix. 

5.  MATRIX-PROJECTION  BASED  PATTERN 
MINING 

In  this  section  we  describe  the  ClsdPtrnMiner  algo¬ 
rithm,  which  forms  the  basis  of  CBMiner  algorithm.  Cls¬ 
dPtrnMiner  follows  the  widely  used  projection-based  pat¬ 
tern  mining  paradigm  [1,  10,  22],  which  can  be  used  to 
efficiently  mine  the  complete  set  of  frequent  patterns  in  a 
depth-first  search  order  and  as  we  will  see  later,  it  can  be 
easily  adapted  to  mine  valid  closed  block  patterns.  A  key 
characteristic  of  ClsdPtrnMiner  (as  well  as  CBMiner) 
is  that  it  represents  the  transaction  database  T  using  the 
transaction-item  matrix  A4  and  employs  a  number  of  effi¬ 
cient  sparse  matrix  storage  and  access  schemes,  allowing  it 
to  achieve  high  computational  efficiency.  For  the  remain¬ 
der  of  this  section  we  describe  the  basic  structure  of  Cls¬ 
dPtrnMiner  for  the  problem  of  enumerating  all  patterns 
satisfying  a  constant  minimum  support  constraint  and  then 
introduce  several  pruning  methods  to  accelerate  the  frequent 
closed  pattern  mining.  The  extension  of  this  algorithm  for 
finding  the  closed  blocks  that  satisfy  the  three  tough  block 
constraints  described  in  Section  3  will  be  described  later  in 
Section  6. 

5.1  Frequent  Pattern  Enumeration 

Given  a  database,  the  complete  set  of  itemsets  can  be 
organized  into  a  lattice  if  the  items  are  in  a  predefined  order, 
and  the  problem  of  frequent  pattern  mining  then  becomes 
how  to  traverse  the  lattice  to  find  the  frequent  ones.  The 
ClsdPtrnMiner  algorithm  adopts  the  depth-first  search 
traversal  and  uses  the  downward  closure  property  to  prune 
the  infrequent  columns  from  further  mining.  Figure  1(a) 
shows  a  database  example  with  a  minimum  support  0.5. 

If  we  remove  the  set  of  infrequent  columns,  {b,f,h,i,k,m}, 
and  sort  the  set  of  frequent  columns  in  frequency-increasing 
order,  then  part  of  the  lattice  {i.e.,  pattern  tree)  formed 
from  column  set  {g,a,c,e,d}  can  be  organized  into  the  one 
shown  in  Figure  1(b).  Each  node  in  the  lattice  is  labeled  in 
the  form  p\q,  where  p  is  a  prefix  itemset  and  q  is  the  set  of 
local  columns  appeared  in  the  p-projected  matrix,  A4\p.  At 
a  certain  node  during  the  depth-first  traversal  of  the  lattice, 
if  the  corresponding  prefix  p  is  infrequent,  we  stop  mining 


4 


{}:{g,a,c,e,d} 


ga:{c,e,d}  gc:{e,d}  ge:{d}  gd  ac:{e,d}  ae:{d}  ad  ce:{d}  cd  ed 

gac:{e,d}  gae:{d}  gad  gce:{d}gcd  ged  ace:{d}  acd  aed  ced 


gace:{d}  gacd  gaed  gced 


gaced 


(b) 


T1D 

Items 

1 

C ,  d,  e,  f.  g,  i 

2 

a,  c,  d,  e,  m 

3 

a,  b,  d ,  e,  g,  k 

4 

a ,  c,  d,  h 

(a) 


Figure  1:  (a)  A  transaction  database  with  9  >  0.5; 
(b)  The  pattern  tree. 


the  sub-tree  under  this  node.  Otherwise,  we  report  p  as  a 
frequent  pattern,  build  its  projected  matrix,  M\p,  find  its 
locally  frequent  columns  in  M\p  and  use  them  to  grow  p  to 
get  longer  itemsets. 

To  store  the  various  projected  matrices  efficiently,  we  adopt 
the  CSR  sparse  storage  scheme  [9] .  The  CSR  format  utilizes 
two  one-dimensional  arrays:  the  first  stores  the  actual  non¬ 
zero  elements  of  the  matrix  in  a  row  (or  column)  major  or¬ 
der,  and  the  second  stores  the  indices  corresponding  to  the 
beginning  of  each  row  (or  column).  To  ensure  that  both  the 
matrix  projection  as  well  as  the  column  frequency  count¬ 
ing  are  performed  efficiently,  we  maintain  both  the  row-  and 
the  column-based  representation  of  the  matrix.  The  overall 
complexity  of  the  algorithm  depends  on  the  two  key  steps 
of  sorting  and  projecting.  We  used  the  radix  sort  algorithm 
to  sort  the  column  frequencies  which  has  a  time  complexity 
that  is  linear  in  the  number  of  columns  being  sorted,  and 
because  of  our  matrix-storage  scheme,  projecting  the  matrix 
on  the  column  is  linear  on  the  number  of  non-zeros  in  the 
projected  matrix.  Our  matrix-projection  based  pattern  enu¬ 
meration  method  shares  some  of  the  ideas  with  the  recently 
developed  array-projection  based  method  [22],  which  was 
shown  to  achieve  good  performance,  especially  for  sparse 
datasets. 

5.2  Frequent  Closed  Pattern  Mining 

The  above  frequent  pattern  enumeration  method  can  find 
the  complete  set  of  frequent  itemsets.  To  get  the  set  of  fre¬ 
quent  closed  itemsets,  we  need  to  check  whether  a  newly 
found  itemset  is  closed  or  not  and  sift  out  the  redundant 
(i. e.,  non-closed)  ones.  The  pattern  closure  checking  in  Cls- 
dPtrnMiner  works  as  follows.  We  maintain  the  set  of  fre¬ 
quent  closed  itemsets  mined  so  far  in  a  hash-table  H  using 
the  sum  of  the  transaction-fDs  of  the  supporting  transac¬ 
tions  as  the  hash-key  [32,  33].  Upon  getting  a  new  itemset 
p,  we  check  against  the  set  of  already  mined  closed  itemsets 
which  have  the  same  hash-key  value  as  the  one  derived  from 
p’s  sum  of  transaction-IDs,  to  see  if  there  is  any  itemset  that 
is  a  proper  superset  of  p  with  the  same  support.  If  that  is 
the  case,  p  is  non-closed,  otherwise  the  union  of  p  and  the 
set  of  its  local  columns  with  the  same  support  as  p  forms  a 


closed  itemset. 

In  the  pattern  enumeration  process,  some  prefix  itemsets 
or  columns  are  unpromising  to  generate  closed  itemsets  and 
thus  can  be  pruned.  ClsdPtrnMiner  adopts  two  pruning 
methods,  redundant  pattern  pruning  and  column  fusing  [4, 
32,  23,  8,  33,  28].  Also,  to  make  the  matrix  a  more  com¬ 
pact  representation,  we  propose  the  row  fusing  method  to 
compress  the  projected  matrix. 

1.  Redundant  Pattern  Pruning  (RPP)  Once  we 
find  that  a  prefix  itemset  is  non-closed,  that  is,  it  is  a 
proper  subset  of  another  already  mined  closed  itemset 
with  the  same  support,  it  can  be  safely  pruned,  and  the 
sub-tree  under  the  node  corresponding  to  this  prefix 
will  not  be  traversed. 

2.  Column  Fusing  (CF)  This  optimization  performs 
two  different  tasks.  First,  it  fuses  the  completely  dense 
columns  of  the  projected  matrix  A4\p  to  the  prefix 
itemset  p  and  removes  them  from  M  |p,  and  thus  avoid¬ 
ing  projections  on  them.  Second,  it  fuses  columns  in 
A4\p  that  have  identical  supporting  transaction  sets 
into  a  single  column,  and  removes  the  original  columns 
from  A4\p.  By  fusing  them,  the  algorithm  reduces  the 
number  of  projections  that  need  to  be  performed,  as 
it  essentially  allows  for  the  pattern  to  grow  by  adding 
multiple  columns  in  a  single  step. 

3.  Row  Fusing  (RF)  The  motivation  behind  this  opti¬ 
mization  is  to  compress  the  projected  matrix  A4\p  by 
fusing  together  the  identical  rows  of  the  matrix  into  a 
single  row.  To  ensure  that  the  frequencies  of  the  var¬ 
ious  patterns  are  computed  correctly,  each  row  of  the 
matrix  has  a  count  associated  with  it  indicating  the 
number  of  rows  in  the  original  matrix  that  it  corre¬ 
sponds  to. 

By  integrating  the  above  optimization  methods  with  the 
frequent  pattern  enumeration  process,  we  get  the  ClsdP¬ 
trnMiner  algorithm  as  shown  in  Algorithm  5.1.  It  takes 
as  input  the  current  pattern  p,  the  p-projected  matrix  A4\p, 
the  given  minimum  support  9,  and  the  current  hash-table  H . 
The  algorithm  initially  sorts  the  columns  of  M\p  and  elimi¬ 
nates  any  infrequent  columns  and  then  proceeds  to  perform 
Column  Fusing  and  Row  Fusing.  After  that  it  enters  its 
main  computational  loop  which  extends  p  by  adding  each 
column  a  E  A4|p,  checks  to  see  if  p  U  {a}  can  be  pruned 
by  comparing  it  against  H  ( Redundant  Pattern  Pruning ), 
projects  At  |p  on  a,  checks  to  see  if  p  U  {a}  is  closed,  and 
finally  calls  itself  recursively  for  pattern  p  U  {a}. 


Algorithm  5.1:  ClsdPtrnMiner(j>,  M |p ,  9,  H ) 


Sort  the  columns  of.A/1 1  p  in  frequency  increasing  order 
Prune  the  columns  in  A4|p  whose  support  is  less  than  9 
if  no  column  is  frequent 

then  return 

Do  Column  Fusing  for  the  columns  in  At|p 
Do  Row  Fusing  for  the  rows  in  _M|P 
for  each  column  a  G  A4\ v 

'if  p  U  {a}  is  a  Redundant  Pattern 
then  continue 

Project  _M|P  on  a  to  get  -M|pU{tt} 
do  if  there  is  no  dense  column  in  Af|pu{a} 

,  /  Output  the  closed  pattern  p  U  {a} 

en  (^Insert  p  U  {a}  into  the  hash-table  H 
. ClsdPtrnMiner (p  U  {a},  M\pU^ay ,  6,  H) 
return 
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6.  CLOSED  BLOCK  MINING  WITH  TOUGH 
CONSTRAINTS 

Like  the  traditional  frequent  closed  pattern  mining  algo¬ 
rithms,  ClsdPtrnMiner  works  under  the  constant  sup¬ 
port  threshold  framework  and  uses  the  downward  closure 
property  to  prune  infrequent  columns.  However,  with  tough 
block  constraints,  the  nice  properties  derived  from  the  anti¬ 
monotone  (or  monotone)  constraints  no  longer  hold  to  be 
used  to  prune  search  space.  Designing  effective  pruning 
methods  for  tough  block  constraints  is  especially  challeng¬ 
ing.  To  address  this  challenge  we  developed  three  classes  of 
pruning  methods,  called  column-pruning ,  row-pruning  and 
matrix  pruning,  which  eliminate  the  unpromising  columns, 
rows  and  projected  matrices  from  mining.  The  specific  de¬ 
tails  of  these  pruning  methods  are  different  for  each  of  the 
three  block  constraints  and  will  be  described  later  in  this 
section. 

By  incorporating  these  three  pruning  methods  with  the 
overall  structure  of  ClsdPtrnMiner,  we  can  easily  derive 
the  CBMiner  algorithm  that  mines  efficiently  the  set  of  all 
valid  closed  block  patterns.  The  pseudo  code  for  CBMiner 
is  shown  in  Algorithm  6.1.  It  takes  as  input  the  current  pat¬ 
tern  p,  its  corresponding  p-projected  matrix  Ai\p,  the  hash- 
table  H  that  stores  the  valid  closed  blocks  that  were  dis¬ 
covered  so  far,  and  the  block-constraint  C  that  corresponds 
to  either  the  block-size,  block-sum,  or  block-similarity  con¬ 
straint.  Since  it  is  derived  from  ClsdPtrnMiner  algo¬ 
rithm,  it  has  many  steps  in  common  and  for  this  reason 
we  will  only  describe  its  key  differences. 


Algorithm  6.1:  CBMiner(p,  A4\p,  H,  C) 


Sort  the  columns  ofAt|p  in  frequency  increasing  order 
if  matrix  At|p  can  be  pruned 

then  return 

Prune  the  columns  in  At|p 
if  no  column  is  valid 

then  return 

Do  Column  Fusing  for  the  columns  in  A4\p 
Do  Row  Fusing  for  the  rows  in  A4\p 
for  each  column  a  £  \p 

'  if  p  U  {a}  is  a  Redundant  Pattern 
then  continue 
let  B=(pU  {o},  T|“) 

Project  A4\p  on  a  to  get  -M|pU{a} 
do  if  $  dense  column  in  _M|pU {a}  and  C(B )  =  TRUE 
J  Output  the  closed  block  B 
en  ^Insert  B  into  the  hash-table  H 
Prune  the  rows  of  .M|pU{a} 

.  CBMiner(p  U  {a},  A4 |pU{a}  ,  H,  C) 
return 


The  first  difference  has  to  do  with  the  pruning  methods. 
Specifically,  instead  of  using  the  constant  support-based  col¬ 
umn  pruning,  CBMiner  uses  the  newly  proposed  column- 
pruning,  row-pruning  and  matrix  pruning  methods,  which 
are  derived  from  the  tough  block  constraints.  The  second 
difference  has  to  do  with  the  implementation  of  the  column 
fusion  and  row  fusion  optimizations  for  the  block-sum  and 
block-similarity  constraints.  In  the  case  of  the  block-sum 
constraint,  the  values  of  the  fused  columns  (and  rows)  corre¬ 
spond  to  the  sum  of  the  values  of  their  constituent  columns 
(and  rows).  This  ensures  that  the  resulting  fused  matrix 


contains  all  necessary  information  to  correctly  evaluate  the 
constraints.  In  the  case  of  the  block-similarity  constraint, 
since  the  correct  evaluation  of  the  constraints  requires  ac¬ 
cess  to  the  individual  column-  and  row-values,  we  do  not 
perform  any  column/row  fusion. 

Following  we  will  introduce  in  detail  the  three  pruning 
methods,  column-pruning,  row-pruning  and  matrix  pruning, 
in  terms  of  the  three  different  block  constraints. 

6.1  Column  Pruning 

Given  a  prefix  itemset  p  and  its  projected  matrix  A4\p,  the 
idea  behind  column  pruning  is  to  identify  for  each  column 
x  £  _M|P  a  necessary  condition  that  must  be  satisfied  such 
that  there  is  a  valid  block  B  =  (p  U  7,  T|pU7)  for  which  7 
is  a  subset  of  the  columns  in  M\p  and  x  £  7.  Using  this 
condition,  we  can  then  eliminate  from  A4\p  all  the  columns 
that  do  not  satisfy  it,  as  these  columns  cannot  be  part  of 
a  valid  block  that  contain  p.  Note  that  for  each  column  x 
that  we  eliminate,  we  prevent  the  exploration  of  the  sub¬ 
tree  associated  with  the  pattern  p  U  {*},  thus,  significantly 
reducing  the  overall  search  space. 

6.1.1  Block  Size 

The  necessary  condition  for  the  block-size  constraint  is 
encapsulated  in  the  following  lemma  (Refer  to  Section  2  for 
a  description  of  the  notation  used). 

Lemma  1  (Block- Size  Column  Pruning)  Let  p  be  a  pat¬ 
tern  and  x  a  column  in  A4\p.  Then  in  order  for  x  to  be 
part  of  a  valid  block  that  satisfies  the  block-size  constraint 
of  Equation  1  and  is  obtained  from  extending  p  by  adding 
columns  from  M\p,  the  following  must  hold: 

BSize(p,  T\p)  +  Y,  \t\>0N.  (5) 

teT\% 

Proof.  Let  7  be  any  arbitrary  set  of  columns  in  M\p 
such  that  a:  £  7  and  the  block  B  =  (p  U  7,  T|pU7)  satisfies 
the  block-size  constraint.  Then  from  the  definition  of  the 
block-size  constraint  we  have  that 

\p  U  7I  x  |T|pU7|  >e  x  N. 

Also,  because  T\p  3  T|pUt  and  for  Vt  £  T|pU7,  |t|  >  I7I,  the 
following  holds: 

E  (IpI  +  w)>  E  (|p|  +  |!|)  >  \p  U  7l  X  mpu7|. 

t€T|pu7 

Equation  5  can  be  obtained  by  combining  the  above  two 
inequalities  and  using  the  fact  that  BSize(p,  T|p)  =  |p|  x 

im  □ 

For  each  column  in  M\p,  Equation  5  can  be  evaluated  by 
adding  up  the  lengths  of  the  rows  that  it  supports.  These 
sums  can  be  computed  for  all  the  columns  by  performing  a 
single  scan  of  the  p-projected  matrix. 

6.1.2  Block  Sum 

The  necessary  condition  for  the  block-sum  constraint  is 
similar  in  nature  to  that  of  the  block-size  constraint  and  is 
encapsulated  in  the  following  lemma. 

Lemma  2  (Block- Sum  Column  Pruning)  Letp  be  a  pat¬ 
tern  and  x  a  column  of  AI|P.  Then  in  order  for  x  to  be 
part  of  a  valid  block  that  satisfies  the  block-sum  constraint 
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of  Equation  2  and  is  obtained  by  extending  p  with  columns 
in  M |p,  the  following  must  hold: 

BSum(p, T|p)  +  Y,  M\p(t,j)  >  OW.  (6) 

\p  ,3  CP\p 

This  lemma  can  be  proved  in  a  similar  way  to  Lemma  1 
and  the  actual  proof  is  omitted.  Note  that  the  summation 
on  the  left-hand-side  of  Equation  6  is  nothing  more  than  the 
sum  of  the  non-zero  elements  of  each  row  in  T|p. 

The  various  quantities  required  to  evaluate  Equation  6 
can  be  computed  efficiently  by  performing  a  single  scan  of 
the  block  ( p ,  T |p)  to  compute  the  sum  of  each  row,  and  two 
scans  of  the  matrix  _M|P.  The  first  scan  will  compute  the 
sum  of  the  non-zero  elements  of  each  row,  and  the  second 
scan  will  compute  the  summation  term  in  Equation  6  for 
each  column. 


follows: 


2D  ■  Bx  =  2  Y  0(.i)  x  csum  A/pj  (i)) 

<  2£  Y  (csumAd|g(*)) 

ieT  |* 


=  2£  Y  (rsumA1|*W) 

teT|* 

<  2££freq(a;)  <  2£rfreq(a;). 

Combining  this  inequality  with  Equation  7  proves  the  lemma. 


□ 


In  a  single  scan  of  the  projected  matrix,  we  can  compute 
the  frequency  of  all  its  columns  along  with  the  value  of  r. 
Hence  the  complexity  is  of  the  order  of  the  size  of  the  pro¬ 
jected  matrix. 


6.1.3  Block  Similarity 

Let  D  be  the  composite  vector  of  T  and  consider  a  p- 
projected  weighted  matrix  M\p.  The  necessary  condition 
for  the  block-similarity  constraint  is  encapsulated  in  the  fol¬ 
lowing  lemma. 

Lemma  3  (Block- Similarity  Column  Pruning)  Letp  be 
a  pattern,  (p,  T |p)  its  corresponding  block,  and  x  a  column  of 
At  |p.  Then  in  order  for  x  to  be  part  of  a  block  that  satisfies 
the  block- similarity  constraint  of  Equation  3  and  is  obtained 
by  extending  p  with  columns  in  M  |p,  the  following  must  hold: 

2D  ■  (Bx  +  Bp)  >  6S  (7) 

Proof.  Let  7  be  any  arbitrary  set  of  columns  in  J\A\P 
such  that  a;  £  7  and  the  block  B  =  (p  U  7,  T|pU7)  satisfies 
the  block-similarity  constraint,  and  let  BpUl  be  its  corre¬ 
sponding  composite  vector.  Then  from  the  definition  of  the 
block-similarity  constraint  we  have  that 

2D  •  Bpu-y  Bp\j-f  -  Bpu-y  T  OS 

Also,  because  T|p  D  T\p  D  T|pU7,  the  following  holds: 

2D  •  (Bp  +  Bx)  >  2D  ■  Bp\j-[  >  2D  •  Bp u7  —  Bp jy  •  Bp u7 

Hence  the  above  two  in-equalities  prove  the  lemma.  □ 

For  each  column  of  A4|p,  evaluating  the  above  equation 
incurs  a  computational  cost  equivalent  to  one  scan  of  the 
p-projected  matrix,  which  is  very  costly.  So,  we  make  use  of 
the  following  lemma,  which  approximates  Equation  7. 


6.2  Row  Pruning 

Given  a  pattern  p  and  its  projected  matrix  A4\p,  the  idea 
behind  row  pruning  is  to  identify  for  each  row  t  £  M\p  a 
necessary  condition  that  must  be  satisfied  such  that  there 
is  a  valid  block  B  —  (p  U  7,  T|pU7)  for  which  7  C  t.  Using 
this  condition,  we  can  then  eliminate  from  M\p  all  the  rows 
that  do  not  satisfy  it,  as  these  rows  cannot  be  part  of  a  valid 
block  that  contain  p.  By  eliminating  such  rows  we  reduce  the 
size  of  Alp  and  thus  reduce  the  amount  of  time  required  to 
perform  subsequent  projections  and  enhance  future  column 
pruning  operations. 

To  derive  such  conditions  we  make  use  of  the  Smallest 
Valid  Extension  (SVE)  principle,  originally  introduced  in 
[25]  for  finding  itemsets  with  length-decreasing  support  con¬ 
straint.  In  the  context  of  block  constraints  considered  in  this 
paper,  the  smallest  valid  extension  of  a  prefix  p  is  defined  as 
the  length  of  the  smallest  possible  extension  7  to  p  (where 
7  is  a  set  of  columns  in  Ai|p),  such  that  the  resulting  block 
B  =  (pU7,T|pu7)  is  valid  for  a  given  constraint  C.  That  is, 

SVE(p)  =  min  (|7|  |  C(p  U  7,  T|pU7)  =  TRUE}. 
~/cl\p 

Knowing  the  SVE  of  a  pattern,  we  can  then  eliminate  all 
the  rows  whose  length  is  smaller  than  the  SVE  value.  Note 
that  the  SVE  of  a  pattern  that  already  corresponds  to  a 
valid  block  will  be  by  definition  zero.  For  this  reason,  the 
row-pruning  is  only  applied  when  the  pattern  p  does  not 
correspond  to  a  valid  block. 

In  the  rest  of  this  section  we  describe  how  to  obtain  such 
SVE-based  necessary  conditions  for  the  block-size,  block- 
sum,  and  block-similarity  constraints. 


Lemma  4  (Approximate  Block- Similarity  Column  Prun¬ 
ing)  Let  £  be  the  maximum  value  across  the  m  dimensions 
of  vector  D  and  r  be  the  maximum  row-sum  over  all  the 
rows  of  the  p-projected  matrix  M\p.  Then  in  order  for  x  to 
be  part  of  a  block  that  satisfies  the  block- similarity  constraint 
of  Equation  3  and  is  obtained  by  extending  p  with  columns 
in  At |p,  the  following  must  hold: 


freq(*)  > 


OS -2D-  Bp 

2 T? 


(8) 


6.2.1  Block  Size 

The  SVE  of  a  pattern  p  for  the  block-size  constraint  is 
given  by  the  following  lemma. 


Lemma  5  (Block- Size  Row  Pruning)  Letp  be  a  pattern 
such  that  B  =  (p,  T|p)  does  not  satisfy  the  block-size  con¬ 
straint.  Then  the  smallest  valid  extension  of  p  for  the  block- 
size  constraint  of  Equation  1  is 


SVE (p)  > 


ON  -  BSize(B) 


(9) 


Proof.  Let  £  be  the  maximum  row-sum  over  all  the  rows 
of  At |p,  we  can  then  rewrite  the  dot  product  2D  ■  Bx  as 


Proof.  We  will  show  that  it  holds  by  contradiction.  As¬ 
sume  that  there  is  a  set  of  items  7  such  that  the  block 
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B'  =  (p  U  7,  T Ipu-y)  is  valid  and 


ItI  < 


ON  -  BSize(B) 


(10) 


Because  |T|pU7|  <  |T|P|,  we  have  that 

BSize(B')  =  (|p  U  7I)  x  |T|pu7|  <  BSize(B)  +  |7|  x  |T|P|. 

(11) 


Combining  Equation  10  and  Equation  11  we  have 
BSize(B')  <  0N, 

indicating  that  B'  does  not  satisfy  the  block-size  constraint, 
violating  our  initial  assumption  about  B' .  This  means  that  p 
needs  to  be  extended  by  adding  at  least  (ON—  BSize(£>))/|T|p| 
items.  □ 


The  complexity  of  computing  the  SVE(p)  is  0(1). 

6.2.2  Block  Sum 

The  SVE  of  a  pattern  p  for  the  block-sum  constraint  is 
given  by  the  following  lemma. 


Lemma  6  (Block-Sum  Row  Pruning)  Let  p  be  a  pattern 
such  that  B  =  (p,  T|p)  does  not  satisfy  the  block-sum  con¬ 
straint,  and  z  be  the  maximum  column-sum  over  all  columns 
of  M\v.  Then  the  smallest  valid  extension  of  p  for  the  block- 
sum  constraint  of  Equation  2  is 


SVE  (p)  > 


0W  -  BSum(B) 


(12) 


Proof.  As  with  Lemma  5  we  will  show  that  it  holds  by 
contradiction.  Assume  that  there  is  a  set  of  items  7  such 
that  the  block  B'  =  (p  U  7,  T|pU7)  is  valid  and 


.  0W  —  BSum(B) 

l7l  <  - ; - — 


(13) 


Because  |T|pU7|  <  |T|P|,  each  column  of  7  must  contribute 
no  greater  than  z  to  the  block-sum  of  B' ,  we  have  that 


BSum(f3,)  <  BSum(B)  +  |7|  x  2.  (14) 


Combining  Equation  13  and  Equation  14  we  have  that 

BSum(B')  <  0W , 

indicating  that  B'  does  not  satisfy  the  block-sum  constraint, 
violating  our  initial  assumption  about  B' .  This  means  that  p 
needs  to  be  extended  by  adding  at  least  (0W  —  BSum(B))/2 
items.  □ 


The  complexity  of  computing  the  SVE(p)  is  of  the  order  of 
the  size  of  the  projected  matrix  as  we  need  one  scan  of  the 
projected  matrix  to  compute  the  maximum  of  the  column- 
sums. 


over  all  columns  of  A4|p.  Then  the  smallest  valid  extension 
of  p  for  the  block- similarity  constraint  of  Equation  3  is 

SVE(p)  >  dS-BSMB)  (15) 

The  proof  is  similar  to  that  used  for  Lemma  6.  In  partic¬ 
ular,  each  column  of  -M|p  can  contribute  at  most  2  to  the 
overall  block-similarity,  and  thus  p  needs  to  grow  by  adding 
at  least  ( 9S  —  BSim(5))/2  items.  The  actual  details  of  the 
proof  are  omitted.  The  complexity  of  computing  the  SVE(p) 
is  identical  to  that  for  the  block-sum  constraint. 

6.3  Matrix  Pruning 

Given  a  prefix  itemset  p  and  its  projected  matrix  A4\p,  the 
column  pruning  and  row  pruning  methods  are  very  effective 
in  pruning  some  unpromising  columns  and  rows  from  A4|p. 
However,  in  many  cases  the  whole  projected  matrix  A4|p 
cannot  be  used  to  generate  any  valid  block  patterns  and 
thus  can  be  pruned.  Hence  we  developed  another  class  of 
pruning  method  called  matrix  pruning  in  order  to  further 
prune  the  search  space  in  terms  of  the  block  size,  block  sum, 
and  block  similarity  constraints. 

6.3.1  Block  Size 

The  necessary  condition  for  the  block-size  constraint  is 
encapsulated  in  the  following  lemma. 

Lemma  8  (Block- Size  Matrix  Pruning)  Let  p  be  a  pat¬ 
tern  and  t  a  transaction  in  A4|p.  Then  in  order  for  A4|p  to 
be  used  to  generate  any  valid  block  that  satisfies  the  block- 
size  constraint  of  Equation  1  and  is  obtained  by  extending  p 
with  some  columns  in  M\p,  the  following  must  hold: 

BSize(p,T|p)+  |t|  >  ON.  (16) 

tcT  |p 

Proof.  Let  7  be  any  arbitrary  set  of  columns  in  A4|p 
such  that  the  block  B  =  (p  U  7,  T|pU7)  satisfies  the  block- 
size  constraint,  that  is: 

|p  U  7I  x  |T|pU7|  >  6  x  N. 

Because  T|pU7  C  T|p  and  for  Vt  E  T|pU7,  |t|  >  |7| ,  the 
following  holds: 

E  (H  +  1*1)  >  E  (N  +  1*1)  >  Ip  U  7l  X  \T\ru-r\. 

iET|p  iE^lpU'y' 

Equation  16  can  be  obtained  by  combining  the  above  two 
inequalities  and  using  the  fact  that  BSize(p,  T|p)  =  |p|  x 

in>i-  □ 

The  sums  in  Equation  16  can  be  computed  by  a  single 
scan  of  the  p-projected  matrix  A4  |p. 

6.3.2  Block  Sum 


6.2.3  Block  Similarity 

Let  D  be  the  composite  vector  of  T  and  consider  a  p- 
projected  weighted  matrix  M\p.  The  column- similarity  of 
column  x  in  M\p  is  denoted  by  csimy^i  (x)  and  is  defined 

to  be  equal  to  2I)(a;)csum^|  (x)  —  csum2^  (x).  Given  this 
definition,  the  SVE  of  a  pattern  p  for  the  block-similarity 
constraint  is  given  by  the  following  lemma. 

Lemma  7  (Block- Similarity  Row  Pruning)  Let  p  be  a 

pattern  such  that  B  =  (p,  T|p)  does  not  satisfy  the  block- 
similarity  constraint,  and  z  is  the  maximum  column- similarity 


The  necessary  condition  for  the  block-sum  constraint  is 
stated  in  the  following  lemma. 

Lemma  9  (Block- Sum  Matrix  Pruning)  Let  p  be  a  pat¬ 
tern,  x  a  column  in  M\p,  and  t  a  transaction  in  M\v.  Then 
in  order  for  M  |p  to  be  used  to  generate  any  valid  block  that 
satisfies  the  block-sum  constraint  of  Equation  2  and  is  ob¬ 
tained  by  extending  p  with  some  columns  in  A4\p,  the  fol¬ 
lowing  must  hold: 

BSum(p,T|p)+  E  M\p(t,  x)  >  6W.  (17) 

teT \p,xeT\p 
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This  lemma  can  be  proved  in  a  similar  way  to  Lemma  8 
and  the  actual  proof  is  omitted.  Note  that  the  summation 
on  the  left-hand-side  of  Equation  17  is  nothing  more  than 
the  sum  of  the  non-zero  elements  of  each  row  in  T\p  and  can 
be  computed  in  one  scan  of  the  p-projected  matrix. 

6.3.3  Block  Similarity 

Using  the  definition  of  the  column- similarity  introduced  in 
Section  6.2.3,  the  necessary  condition  for  the  block-similarity 
constraint  can  be  stated  as  follows: 

Lemma  10  (Block- Similarity  Matrix  Pruning)  Letp  be 
a  pattern,  x  a  column  of  M \p,  and  csimjV(jp(a:)  the  column- 
similarity  of  x  in  M\p.  Then  in  order  for  _M|P  to  be  used 
to  generate  any  valid  blocks  that  satisfy  the  block- similarity 
constraint  of  Equation  3  and  is  obtained  from  extending  p 
with  some  columns  in  M\p,  the  following  must  hold: 

BSim(p,  T\p)  +  ^2  csimy^i^a:)  >  6S  (18) 

xCL\p 

The  proof  is  similar  to  that  used  for  Lemma  6.8.  In 
particular,  each  column  *  of  A4\p  can  contribute  at  most 
csim^^  (x)  to  the  overall  block-similarity  and  thus  the  whole 
matrix  M\p  contributes  at  most  the  sum  of  the  column- 
similarities  of  its  columns.  The  actual  details  of  the  proof 
are  omitted.  The  column-similarities  of  all  the  columns  can 
be  computed  in  a  single  scan  of  the  p-projected  matrix. 


Data 

H  Trans 

Items 

A.(M.)tran.len. 

gazelle 

59601 

498 

2.5(267) 

BMSWebView2 

77512 

3340 

4.6(161) 

pumsb* 

49046 

2089 

50.5(63) 

big-market 

838466 

38336 

3.12(90) 

Sports 

8580 

126373 

258.3(2344) 

T40I10D100K 

100000 

1000 

39.6(77) 

Table  1:  Dataset  Characteristics. 


7.  EXPERIMENTS 

7.1  Test  Environment  and  Datasets 

In  this  section,  we  evaluate  CBMiner  for  all  three  con¬ 
straints.  All  the  experiments  were  performed  on  a  2GHz 
Intel  P4  processor  with  2GB  of  memory  running  Linux. 
CBMiner  was  implemented  in  C  and  all  times  reported 
are  in  seconds.  Since  there  are  no  existing  closed  block- 
constraint  mining  algorithms,  we  chose  one  of  the  most  re¬ 
cently  developed  frequent  closed  itemset  mining  algorithms, 
CLOSET+  [28],  for  our  comparisons.  We  compared  CBMiner 
with  a  Linux  executable  version  of  CLOSET+,  by  provid¬ 
ing  the  minimum  frequency  of  the  valid  closed  block  pat¬ 
terns  generated  by  CBMiner  as  the  absolute  minimum  sup¬ 
port  threshold  to  CLOSET+.  This  ensures  that  CLOSET+ 
will  discover  all  the  patterns  found  by  CBMiner.  How¬ 
ever,  CLOSET+  will  find  additional  patterns  that  do  not 
satisfy  the  various  block  constraints.  We  compared  the  run¬ 
ning  times  as  well  as  the  number  of  patterns  generated  by 
CBMiner  with  those  of  CLOSET+.  We  also  evaluated  the 
pruning  methods  that  we  proposed  and  their  various  pos¬ 
sible  combinations.  We  used  five  real  datasets  and  several 
synthetic  datasets  to  evaluate  the  algorithm  performance, 
the  effectiveness  of  the  pruning  methods,  and  the  scalability 


in  terms  of  the  database  size.  The  characteristics  (number 
of  transactions,  number  of  items  and  the  average  (maximum) 
transaction  lengths)  of  the  datasets  are  shown  in  the  Table  1. 

Real  datasets:  The  gazelle  and  BMSWebView2  datasets 
contain  the  click-stream  data  from  Gazellc.com.  The  pumsb* 
dataset  contains  census  data  and  big-market  dataset  con¬ 
tains  the  transaction  information  of  a  retail  store.  The  sports 
dataset  is  a  document  dataset  obtained  from  San  Jose  Mer¬ 
cury  (TREC). 

Synthetic  datasets:  The  synthetic  datasets  were  gener¬ 
ated  from  IBM  dataset  generator,  with  average  transaction 
lengths  of  40,20,30  and  frequent  itemset  lengths  of  10,10,15 
respectively.  We  used  T40I10D100K  for  our  comparisons 
with  CLOSET+  as  it  was  used  in  many  previous  perfor¬ 
mance  studies  and  the  remaining  two  datasets,  T20I10D100Kx 
and  T30I15D100Kx,  for  our  scalability  experiments. 

7.2  Experimental  Results 

7.2.1  Comparison  with  CLOSET + 

We  performed  numerous  experiments  to  compare  CBMiner 
with  CLOSET+.  Figs.  2-9  show  the  comparison  results 
for  the  BSize  constraint  and  datasets  gazelle,  big-market, 
pumsb*,  and  TfOIlODlOOK,  while  Figs.  10-11  show  the  re¬ 
sults  for  the  BSum  constraint  and  the  sports  dataset,  and 
Figs.  12-13  show  the  results  for  the  BSim  constraint  and 
the  BMSWebView2  dataset.  From  these  results  we  can 
see  that,  in  general,  CBMiner  is  substantially  faster  than 
CLOSET+.  This  is  primary  due  to  the  fact  that,  as  it  was 
expected,  CLOSET+  produces  significantly  more  patterns 
than  those  produced  by  CBMiner.  For  datasets  with  short 
transactions  like  gazelle,  BMSWebView2,  and  big-market, 
CBMiner  can  be  order(s)  of  magnitude  faster  than  CLOSET+, 
and  finds  order(s)  of  magnitude  fewer  patterns.  While  for 
the  datasets  with  long  transactions  like  pumsb*,  sports,  and 
TfOIlODlOOK ,  CLOSET+  is  a  little  faster  at  high  block 
threshold  of  BSize  and  BSum,  but  once  the  threshold  is  low¬ 
ered,  there  is  an  explosive  increase  in  the  number  of  frequent 
closed  itemsets  ( e.g .,  with  BSize/BSum  0.2%  CLOSET+ 
generates  several  orders  of  magnitude  more  patterns  than 
CBMiner).  These  results  illustrate  that  the  pruning  meth¬ 
ods  used  by  CBMiner  are  indeed  effective  in  reducing  the 
overall  search  space,  leading  to  substantial  performance  im¬ 
provements. 


Fig.  2.  jf  Patt.  (gazelle).  Fig.  3.  Runtime  (gazelle). 
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( big-market ).  ( big-market ).  ( T40I10D100K ).  ( T40I10D100K ). 
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Fig.  6.  #  Patt.  ( pumsb *).  Fig.  7.  Runtime  ( pumsb *).  Fig.  10.  #  Patt.  (sports).  Fig.  11.  Runtime  (sports). 


We  evaluated  the  effectiveness  of  the  three  newly  pro¬ 
posed  pruning  methods,  Column  Pruning  ( CP),  Row  Prun¬ 
ing  (RP),  and  Matrix  Pruning  (MP)  through  their  various 
combinations.  Figs.  14-15  show  the  comparison  results  for 
the  BSize  constraint  and  datasets  gazelle  and  BMSWeb- 
View2,  Fig.  16  shows  the  results  for  the  BSum  constraint 
and  dataset  pumsb*,  while  Fig.  17  shows  the  results  for  the 
BSim  constraint  and  dataset  big-market.  These  results  show 
that  the  combination  of  all  the  three  pruning  techniques  (de¬ 
noted  as  CP+RP+MP)  is  always  faster  than  each  individual 
pruning  method,  and  for  the  BSize  and  BSum  constraints 
the  overall  ranking  of  the  pruning  effectiveness  among  the 
three  methods  is  Column  Pruning  >  Matrix  Pruning  > 
Row  Pruning,  while  for  the  BSim  constraint  Matrix  Pruning 
is  more  effective  than  Row  Pruning  and  Column  Pruning. 
Note  that  if  we  do  not  apply  any  of  these  three  pruning 
methods  (denoted  as  No-Pruning),  CBMiner  will  run  too 
slow,  and  for  this  reason  we  do  not  show  the  curves  corre¬ 
sponding  to  No-Pruning  in  Figs.  14  17. 

7.2.3  Scalability  Study 

We  used  the  IBM  synthetic  datasets  T20I10D100K  and 
T30I15D100K  for  the  scalability  test  of  CBMiner.  We  repli¬ 
cated  their  sizes  ‘x’  times,  where  ‘x’  varies  from  2  to  10.  For 
the  T20I10D100Kx  series  of  datasets,  we  fixed  the  BSize, 
BSum  threshold  at  0.02%  and  BSim  threshold  at  0.20%.  For 
the  T30I15D100Kx  series  of  datasets,  we  fixed  the  BSize  and 
BSum  thresholds  at  0.05%  and  BSim  threshold  at  0.20%. 
From  Figs.  18  and  19,  we  can  see  that  CBMiner  has  linear 
scalability  on  all  the  three  constraints. 

7.3  Application  -  Micro  Concept  Discovery 


Finally,  we  demonstrate  an  application  for  the  three  block 
constraints  in  the  context  of  document  clustering  by  show¬ 
ing  that  the  blocks  discovered  by  these  constraints  represent 
sets  of  documents  that  have  a  great  chance  of  belonging  to 
the  same  cluster  and  hence  can  be  used  to  identify  potential 
cores  of  natural  clusters  in  data  as  well  as  thematically  re¬ 
lated  words.  For  this  application  we  chose  two  additional 
document  datasets  viz.,  LAI  and  Classic  in  addition  to 
Sports.  The  LAI  dataset  contains  articles  that  appeared 
in  LA  Times  news,  whereas  the  Classic  dataset  contains 
abstracts  of  technical  papers.  Some  of  the  characteristics  of 
these  datasets  are  shown  in  Table  2.  We  scaled  the  document 
vectors  using  the  well  known  tf-idf  scaling  and  normalized 
using  L2-norm  and  used  our  closed  block  mining  algorithm 
with  block  size,  block-sum  and  block-similarity  constraints. 
From  the  patterns  that  were  found  we  chose  the  1000  high¬ 
est  ranked  patterns  on  the  basis  of  the  constraint  value. 
For  example,  for  the  block  sum  constraint,  we  selected  the 
top- 1000  blocks  ranked  on  block  sum  and  in  the  same  way 
for  block-size  and  block-similarity  constraints.  For  each  of 
the  top- 1000  blocks  we  computed  the  entropies  [34]  of  the 
documents  that  formed  the  supporting  set  of  the  block  and 
took  the  average  of  the  1000  entropies.  Similarly,  we  com¬ 
puted  the  average  block  pattern  frequency  and  average  block 


Data 

No.  of  documents 

No.  of  terms 

No.  of  classes 

Classic 

Sports 

LAI 

7089 

8580 

3204 

12009 

18324 

31472 

4 

7 

6 

Table  2:  Summary  of  document  datasets  used  for 
the  application. 
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Fig.  14.  Pruning  methods.  Fig.  15.  Pruning  methods. 
{gazelle).  ( BMSWebView2 ). 


pattern  length.  For  comparison  purposes,  we  also  used  the 
CLOSET+  algorithm  to  find  a  set  of  frequent  closed  item- 
sets  and  also  selected  the  1000  most  frequent  itemsets  dis¬ 
covered  by  CLOSET+.  Fig.  20  shows  the  average  entropy, 
frequency,  and  length  of  the  various  patterns  discovered  by 
the  four  algorithms  for  the  three  datasets.  Note  that  the 
CLOSET+  results  are  labeled  as  “freq”. 

From  these  results  we  can  see  that  the  average  entropy 
of  the  patterns  discovered  by  the  four  schemes  are  quite 
small,  indicating  that  all  of  them  do  reasonably  well  in 
identifying  itemsets  whose  supporting  documents  are  pri¬ 
marily  from  a  single  class.  Despite  that,  we  can  see  that 
the  block-similarity  constraint  outperforms  the  rest,  as  it 
leads  to  the  lowest  entropies  ( i.e .,  purest  clusters)  for  all 
datasets.  This  verifies  our  initial  motivation  for  defining  the 
block-similarity  constraint,  as  it  is  able  to  better  capture 
the  characteristics  of  the  underlying  datasets  and  problem, 
and  discover  sets  of  words  that  are  thematically  very  re¬ 
lated.  The  block-size  and  the  itemset  support  constraints 
show  some  inconsistency  in  finding  good  concepts  as  they 
do  not  account  for  the  weights  associated  with  the  terms  in 
the  document-term  matrices.  On  the  other  hand  the  block- 
sum  constraint  does  reasonably  well  as  it  was  able  to  take 
into  account  the  differences  in  the  terms  weights  provided 
by  the  L2-norm  and  tf-idf  scaling  for  the  document  vectors. 
Also  note  that  the  highest  ranked  patterns  discovered  by 
the  frequent  closed  mining  algorithm  (CLOSET+)  are  in 
general  quite  short  compared  to  the  length  of  the  patterns 
discovered  by  the  block  constraints. 

8.  CONCLUSION  AND  FUTURE  WORK 
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Fig.  18.  scalability  Fig.  19.  scalability 

( T20I1 0D1  OOKx) .  {T30I15D100Kx). 


In  this  paper  we  studied  how  to  mine  valid  closed  pat¬ 
terns  with  tough  block  constraints  and  proposed  a  matrix- 
projection  based  framework  called  CBMiner  for  mining  closed 
block  patterns  in  transaction-item  or  document-term  matri¬ 
ces  effectively.  Under  this  framework  we  mainly  discussed 
three  typical  block  constraints  viz.,  block  size,  block  sum  and 
block  similarity.  Some  widely  adopted  properties  derived 
from  the  anti-monotone  or  monotone  constraints  no  longer 
hold  to  be  used  to  prune  search  space  for  these  tough  block 
constraints.  As  a  result,  we  specifically  proposed  three  novel 
pruning  methods,  column  pruning,  row  pruning  and  matrix 
pruning,  which  can  push  deeply  the  block  constraints  into 
pattern  discovery  and  prune  the  unpromising  columns,  rows, 
and  projected  matrices  effectively.  Moreover,  our  experi¬ 
mental  results  show  C'BMiner  finds  much  fewer  patterns 
and  runs  order(s)  of  magnitude  faster  than  the  traditional 
frequent  closed  pattern  mining  algorithms. 
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